Skip to content

[mlir][linalg] Enable scalable vectorization of linalg.unpack #149293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 6, 2025

Conversation

banach-space
Copy link
Contributor

@banach-space banach-space commented Jul 17, 2025

This patch updates vectorizeAsTensorUnpackOp to support scalable
vectorization by requiring user-specified vector sizes for the read operation
(rather than the write operation) in linalg.unpack. The rationale and an
example follow.

Conceptually, linalg.unpack consists of these high-level steps:

  • Read from the source tensor using vector.transfer_read.
  • Transpose the read value according to the permutation in the
    linalg.unpack op (via vector.transpose).
  • Re-associate dimensions of the transposed value, as specified by the op
    (via vector.shape_cast)
  • Write the result into the destination tensor via vector.transfer_write.

Previously, the vector sizes provided by the user were interpreted as
write-vector sizes. These were used to:

  • Infer read-vector sizes using the inner_tiles attribute of the unpack op.
  • Deduce vector sizes for the transpose and shape cast operations.
  • Ultimately determine the vector shape for the write.

However, this logic breaks when one or more tile sizes are dynamic. In such
cases, vectorizeUnPackOpPrecondition fails, and vectorization is rejected.

This patch switches the contract: users now directly specify the
"read-vector-sizes", which inherently encode all inner tile sizes - including
dynamic ones. It becomes the user's responsibility to provide valid sizes.

In practice, since linalg.unpack is typically constructed, tiled, and
vectorized by the same transformation pipeline, the necessary
"read-vector-sizes" should be recoverable.

With this change, we are finally able to vectorise linalg.unpack with
dynamic shapes using scalable vector sizes - see the included tests.

@banach-space banach-space force-pushed the users/banach-space/linalg/unpack_vec_update branch from 0267d2a to 7654b2d Compare July 21, 2025 09:41
@banach-space banach-space changed the base branch from main to users/banach-space/linalg/group_unpack_tests July 21, 2025 09:41
@banach-space banach-space force-pushed the users/banach-space/linalg/group_unpack_tests branch from ef06440 to 076cde5 Compare July 21, 2025 10:25
@banach-space banach-space force-pushed the users/banach-space/linalg/unpack_vec_update branch from 7654b2d to 5deba69 Compare July 21, 2025 10:26
Base automatically changed from users/banach-space/linalg/group_unpack_tests to main July 21, 2025 18:31
@banach-space banach-space force-pushed the users/banach-space/linalg/unpack_vec_update branch from 5deba69 to 953eaf0 Compare July 24, 2025 09:11
@banach-space banach-space marked this pull request as ready for review July 24, 2025 09:12
@llvmbot
Copy link
Member

llvmbot commented Jul 24, 2025

@llvm/pr-subscribers-mlir
@llvm/pr-subscribers-mlir-vector

@llvm/pr-subscribers-mlir-linalg

Author: Andrzej Warzyński (banach-space)

Changes

This patch updates vectorizeAsTensorUnpackOp to support scalable vectorization by requiring user-specified vector sizes for both the read and write operations involved in linalg.unpack. Detailed rationale and an example are provided below.

Conceptually, linalg.unpack consists of the following high-level steps:

  1. Read from the source tensor.
  2. Transpose the value read in step (1).
  3. Write the value from step (2) into the destination tensor.

Currently, when vectorizing with user-provided vector sizes, only the sizes for the write operation (step 3) are required. Sizes for the read operation (step 1) are inferred from static shapes and inner tile sizes.

This logic breaks when the input shapes or tile sizes are dynamic (indeed, vectorizeUnPackOpPrecondition rejects such cases ATM and the vectorization fails). This patch addresses the issue by requiring explicit vector sizes for both the read and write sides, enabling scalable vectorization in such cases.

Example:

func.func @<!-- -->unpack(%in: tensor&lt;1x1x8x?xf32&gt;, %out: tensor&lt;8x?xf32&gt;) -&gt; tensor&lt;8x?xf32&gt; {
  %vs = vector.vscale
  %c8 = arith.constant 8 : index
  %tile_size = arith.muli %vs, %c8 : index

  %unpack = linalg.unpack  %in
    inner_dims_pos = [0, 1]
    inner_tiles = [8, %tile_size]
    into %out : tensor&lt;1x1x8x?xf32&gt; -&gt; tensor&lt;8x?xf32&gt;
  return %unpack : tensor&lt;8x?xf32&gt;
}

module attributes {transform.with_named_sequence} {
  transform.named_sequence @<!-- -->__transform_main(%arg0: !transform.any_op {transform.readonly}) {
    %0 = transform.structured.match ops{["linalg.unpack"]} in %arg0 : (!transform.any_op) -&gt; !transform.any_op
    transform.structured.vectorize %0 vector_sizes [1, 1, 8, [8],  8, [8]] : !transform.any_op
    //                                              \         /    \    /
    //                                              read-sizes   write-sizes
    transform.yield
  }
}

Finally, this patch also extends createReadOrMaskedRead and createWriteOrMaskedWrite to take scalable flags.


Patch is 27.05 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/149293.diff

4 Files Affected:

  • (modified) mlir/include/mlir/Dialect/Vector/Utils/VectorUtils.h (+1-1)
  • (modified) mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp (+89-41)
  • (modified) mlir/lib/Dialect/Vector/Utils/VectorUtils.cpp (+12-10)
  • (modified) mlir/test/Dialect/Linalg/vectorization/linalg-ops.mlir (+107-28)
diff --git a/mlir/include/mlir/Dialect/Vector/Utils/VectorUtils.h b/mlir/include/mlir/Dialect/Vector/Utils/VectorUtils.h
index 7cd70e42d363c..8bd54cf31b893 100644
--- a/mlir/include/mlir/Dialect/Vector/Utils/VectorUtils.h
+++ b/mlir/include/mlir/Dialect/Vector/Utils/VectorUtils.h
@@ -228,7 +228,7 @@ bool isLinearizableVector(VectorType type);
 Value createReadOrMaskedRead(OpBuilder &builder, Location loc, Value source,
                              ArrayRef<int64_t> inputVectorSizes, Value padValue,
                              bool useInBoundsInsteadOfMasking = false,
-                             ArrayRef<bool> scalableDims = {});
+                             ArrayRef<bool> inputScalableVecDims = {});
 
 /// Returns success if `inputVectorSizes` is a valid masking configuraion for
 /// given `shape`, i.e., it meets:
diff --git a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
index 77c85abab9aa0..78f1c524b69fd 100644
--- a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
+++ b/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
@@ -1806,7 +1806,8 @@ vectorizeAsTensorPackOp(RewriterBase &rewriter, linalg::PackOp packOp,
     inputShape[innerDimsPos[idx]] *= size;
   auto maskedRead = vector::createReadOrMaskedRead(
       rewriter, loc, packOp.getSource(), inputShape, padValue,
-      useInBoundsInsteadOfMasking);
+      useInBoundsInsteadOfMasking,
+      /*inputScalableVecSizes=*/{});
 
   // Create ShapeCastOp.
   SmallVector<int64_t> destShape(inputVectorSizes);
@@ -1832,18 +1833,23 @@ vectorizeAsTensorPackOp(RewriterBase &rewriter, linalg::PackOp packOp,
   return success();
 }
 
-/// Vectorize a `linalg::UnPackOp` to these 4 Ops:
-///   Vector::TransferReadOp - Reads a vector from the source tensor
-///   vector::TransposeOp - Transpose the Source tensor
-///   ShapeCastOp - Reshape the data based on the target.
-///   vector::TransferWriteOp. - Write the result vector back to the destination
-///   tensor.
-///   If the vector sizes are not provided:
+/// Vectorize `linalg.unpack %src into %dest` as:
+///   // Reads a vector from the source tensor
+///   %read = vector.transfer_read  %src
+///   // Transpose %read as specified in `outer_dims_perm` attribute
+///   %tr = vector.transpose %read
+///   // Reshape the data based on the target
+///   %sc = vector.shape_cast %tr
+///   // Write the result vector to the destination tensor.
+///   vector.transfer_write %sc into %dest
+///
+///  If the vector sizes are not provided:
 ///   * the vector sizes are determined by the input operand and attributes,
 ///   * update the inBounds attribute instead of masking.
 static LogicalResult
 vectorizeAsTensorUnpackOp(RewriterBase &rewriter, linalg::UnPackOp unpackOp,
                           ArrayRef<int64_t> inputVectorSizes,
+                          ArrayRef<bool> inputScalableVecDims,
                           SmallVectorImpl<Value> &newResults) {
 
   // TODO: Introduce a parent class that will handle the insertion point update.
@@ -1860,25 +1866,54 @@ vectorizeAsTensorUnpackOp(RewriterBase &rewriter, linalg::UnPackOp unpackOp,
 
   auto destSize = unpackOp.getDestRank();
 
-  if (!inputVectorSizes.empty())
-    assert(inputVectorSizes.size() == destSize &&
+  if (!inputVectorSizes.empty()) {
+    assert(inputVectorSizes.size() == destSize + sourceShape.size() &&
            "Incorrect number of input vector sizes");
+  }
+
+  SmallVector<bool> readScalableVectorFlags;
+  SmallVector<bool> writeScalableVectorFlags;
+  SmallVector<int64_t> readVectorSizes;
+  SmallVector<int64_t> writeVectorSizes;
 
-  // vectorSizes is the shape of the vector that will be used to do final
+  // Split input-vector-sizes into vector sizes for the read and write
+  // operations.
+  if (!inputVectorSizes.empty()) {
+    readVectorSizes.append(inputVectorSizes.begin(),
+                           inputVectorSizes.begin() + sourceShape.size());
+    writeVectorSizes.append(inputVectorSizes.begin() + sourceShape.size(),
+                            inputVectorSizes.end());
+  }
+  if (!inputScalableVecDims.empty()) {
+    readScalableVectorFlags.append(inputScalableVecDims.begin(),
+                                   inputScalableVecDims.begin() +
+                                       sourceShape.size());
+    writeScalableVectorFlags.append(inputScalableVecDims.begin() +
+                                        sourceShape.size(),
+                                    inputScalableVecDims.end());
+  } else {
+    readScalableVectorFlags = SmallVector<bool>(sourceShape.size(), false);
+    writeScalableVectorFlags = SmallVector<bool>(destSize, false);
+  }
+
+  // writeVectorSizes is the shape of the vector that will be used to do final
   // write on the destination tensor. It is set like this: Let's say the
   // source tensor is rank 'M' and the dest tensor rank 'N', where N <= M.
   // Thus:
-  // 1. vectorSizes = sourceShape.take_front(N)
-  // 2. if outer_dims_perms is present: do that permutation on vectorSizes.
+  // 1. writeVectorSizes = sourceShape.take_front(N)
+  // 2. if outer_dims_perms is present: do that permutation on writeVectorSizes.
   // 3. multiply all the locations in vectorSize pointed by innerDimPos by the
   //    innerTiles attribute value.
-  SmallVector<int64_t> vectorSizes(inputVectorSizes);
-  if (vectorSizes.empty()) {
-    llvm::append_range(vectorSizes, sourceShape.take_front(destSize));
+  // SmallVector<int64_t> writeVectorSizes(inputVectorSizes);
+  if (writeVectorSizes.empty()) {
+    if (ShapedType::isDynamicShape(sourceShape))
+      return failure();
+
+    llvm::append_range(writeVectorSizes, sourceShape.take_front(destSize));
     if (!outerDimsPerm.empty())
-      applyPermutationToVector(vectorSizes, outerDimsPerm);
+      applyPermutationToVector(writeVectorSizes, outerDimsPerm);
     for (auto [i, pos] : llvm::enumerate(innerDimPos))
-      vectorSizes[pos] *= innerTiles[i];
+      writeVectorSizes[pos] *= innerTiles[i];
 
     useInBoundsInsteadOfMasking = true;
   }
@@ -1902,17 +1937,20 @@ vectorizeAsTensorUnpackOp(RewriterBase &rewriter, linalg::UnPackOp unpackOp,
   //   After applying outer_dims_perm: [8, 16]
   //   After appending the rest of the sourceShape: [8, 16, 32, 16]
 
-  SmallVector<int64_t> readVectorSizes(vectorSizes.begin(), vectorSizes.end());
-
-  for (auto [index, size] : enumerate(innerTiles)) {
-    readVectorSizes[innerDimPos[index]] =
-        llvm::divideCeil(readVectorSizes[innerDimPos[index]], size);
-  }
-  if (!outerDimsPerm.empty()) {
-    applyPermutationToVector(readVectorSizes, outerDimsPerm);
+  if (readVectorSizes.empty()) {
+    // Compute read-vector-sizes based on the write-vector-sizes and inner tile
+    // sizes. Note, this will only work when all sizes are static.
+    readVectorSizes = writeVectorSizes;
+    for (auto [index, size] : enumerate(innerTiles)) {
+      readVectorSizes[innerDimPos[index]] =
+          llvm::divideCeil(readVectorSizes[innerDimPos[index]], size);
+    }
+    if (!outerDimsPerm.empty()) {
+      applyPermutationToVector(readVectorSizes, outerDimsPerm);
+    }
+    readVectorSizes.append(sourceShape.begin() + writeVectorSizes.size(),
+                           sourceShape.end());
   }
-  readVectorSizes.append(sourceShape.begin() + vectorSizes.size(),
-                         sourceShape.end());
 
   ReifiedRankedShapedTypeDims reifiedRetShapes;
   LogicalResult status =
@@ -1931,7 +1969,7 @@ vectorizeAsTensorUnpackOp(RewriterBase &rewriter, linalg::UnPackOp unpackOp,
   // to shape of source, then a mask is necessary.
   Value readResult = vector::createReadOrMaskedRead(
       rewriter, loc, unpackOp.getSource(), readVectorSizes, padValue,
-      /*useInBoundsInsteadOfMasking=*/false);
+      /*useInBoundsInsteadOfMasking=*/false, readScalableVectorFlags);
 
   PackingMetadata packMetadata;
   SmallVector<int64_t> lastDimToInsertPosPerm =
@@ -1950,15 +1988,17 @@ vectorizeAsTensorUnpackOp(RewriterBase &rewriter, linalg::UnPackOp unpackOp,
   RankedTensorType collapsedType = tensor::CollapseShapeOp::inferCollapsedType(
       stripMineTensorType, packMetadata.reassociations);
   mlir::VectorType vecCollapsedType =
-      VectorType::get(collapsedType.getShape(), collapsedType.getElementType());
+      VectorType::get(collapsedType.getShape(), collapsedType.getElementType(),
+                      writeScalableVectorFlags);
   vector::ShapeCastOp shapeCastOp = rewriter.create<vector::ShapeCastOp>(
       loc, vecCollapsedType, transposeOp->getResult(0));
 
-  // writeVectorSizes had to match the shapecast shape for dynamic sizes,
+  // writeVectorSizesFinal had to match the shapecast shape for dynamic sizes,
   // otherwise the validator complains that the mask size is invalid.
-  SmallVector<int64_t> writeVectorSizes(
+  // FIXME: We should not override write-vector-sizes like this.
+  SmallVector<int64_t> writeVectorSizesFinal(
       unpackOp.getDestType().hasStaticShape()
-          ? vectorSizes
+          ? writeVectorSizes
           : shapeCastOp.getResultVectorType().getShape());
   Operation *write = createWriteOrMaskedWrite(
       rewriter, loc, shapeCastOp.getResult(), unpackOp.getDest(),
@@ -1989,7 +2029,7 @@ vectorizeAsTensorPadOp(RewriterBase &rewriter, tensor::PadOp padOp,
   assert(succeeded(status) && "failed to reify result shapes");
   auto maskedRead = vector::createReadOrMaskedRead(
       rewriter, loc, padOp.getSource(), inputVectorSizes, padValue,
-      /*useInBoundsInsteadOfMasking=*/false);
+      /*useInBoundsInsteadOfMasking=*/false, /*inputScalableVecSizes=*/{});
 
   // Create Xfer write Op
   Value dest = rewriter.create<tensor::EmptyOp>(
@@ -2073,6 +2113,9 @@ static LogicalResult
 vectorizeUnPackOpPrecondition(linalg::UnPackOp unpackOp,
                               ArrayRef<int64_t> inputVectorSizes) {
 
+  // FIXME!!!
+  return success();
+
   if (llvm::any_of(unpackOp.getInnerTiles(), [](OpFoldResult res) {
         return !getConstantIntValue(res).has_value();
       })) {
@@ -2409,6 +2452,7 @@ vectorizePackOpPrecondition(linalg::PackOp packOp,
     LDBG("pad value is not constant: " << packOp << "\n");
     return failure();
   }
+
   ArrayRef<int64_t> resultTensorShape = packOp.getDestType().getShape();
   bool satisfyEmptyCond = true;
   if (inputVectorSizes.empty()) {
@@ -2487,12 +2531,14 @@ vectorizeScalableVectorPrecondition(Operation *op,
   if (numOfScalableDims == 0)
     return success();
 
+  // TODO: Check the following!
   auto linalgOp = dyn_cast<LinalgOp>(op);
 
-  // Cond 1: There's been no need for scalable vectorisation of
-  // non-linalg Ops so far
-  if (!linalgOp)
-    return failure();
+  // Cond 1: Reject Ops that don't implement the LinalgOp interface, with the
+  // exception of UnpackOp for which there is a dedicated hook.
+  if (!linalgOp) {
+    return isa<linalg::UnPackOp>(op) ? success() : failure();
+  }
 
   // Cond 2: There's been no need for more than 2 scalable dims so far
   if (numOfScalableDims > 2)
@@ -2588,7 +2634,7 @@ vectorizeScalableVectorPrecondition(Operation *op,
                  isa<linalg::MatmulTransposeAOp>(op) ||
                  isa<linalg::DepthwiseConv1DNwcWcOp>(op) ||
                  isa<linalg::MatvecOp>(op) || isa<linalg::Mmt4DOp>(op) ||
-                 hasReductionIterator(linalgOp));
+                 isa<linalg::UnPackOp>(op) || hasReductionIterator(linalgOp));
 }
 
 LogicalResult mlir::linalg::vectorizeOpPrecondition(
@@ -2723,7 +2769,8 @@ FailureOr<VectorizationResult> mlir::linalg::vectorize(
           })
           .Case<linalg::UnPackOp>([&](auto unpackOp) {
             return vectorizeAsTensorUnpackOp(rewriter, unpackOp,
-                                             inputVectorSizes, results);
+                                             inputVectorSizes,
+                                             inputScalableVecDims, results);
           })
           .Case<tensor::InsertSliceOp>([&](auto sliceOp) {
             return vectorizeAsInsertSliceOp(rewriter, sliceOp, inputVectorSizes,
@@ -3114,7 +3161,8 @@ vectorizeAsInsertSliceOp(RewriterBase &rewriter, tensor::InsertSliceOp sliceOp,
       vecType.getRank(), rewriter.create<arith::ConstantIndexOp>(loc, 0));
   Value read = mlir::vector::createReadOrMaskedRead(
       rewriter, loc, source, vecType.getShape(), padValue,
-      /*useInBoundsInsteadOfMasking=*/inputVectorSizes.empty());
+      /*useInBoundsInsteadOfMasking=*/inputVectorSizes.empty(),
+      /*inputScalableVecSizes=*/{});
 
   // Create write
   auto writeIndices =
diff --git a/mlir/lib/Dialect/Vector/Utils/VectorUtils.cpp b/mlir/lib/Dialect/Vector/Utils/VectorUtils.cpp
index c045063e8194f..a379229d3d5a5 100644
--- a/mlir/lib/Dialect/Vector/Utils/VectorUtils.cpp
+++ b/mlir/lib/Dialect/Vector/Utils/VectorUtils.cpp
@@ -281,14 +281,16 @@ vector::createUnrollIterator(VectorType vType, int64_t targetRank) {
   // Attempt to unroll until targetRank or the first scalable dimension (which
   // cannot be unrolled).
   auto shapeToUnroll = vType.getShape().drop_back(targetRank);
-  auto scalableDimsToUnroll = vType.getScalableDims().drop_back(targetRank);
-  auto it = llvm::find(scalableDimsToUnroll, true);
-  auto firstScalableDim = it - scalableDimsToUnroll.begin();
+  auto inputScalableVecDimsToUnroll =
+      vType.getScalableDims().drop_back(targetRank);
+  auto it = llvm::find(inputScalableVecDimsToUnroll, true);
+  auto firstScalableDim = it - inputScalableVecDimsToUnroll.begin();
   if (firstScalableDim == 0)
     return {};
   // All scalable dimensions should be removed now.
-  scalableDimsToUnroll = scalableDimsToUnroll.slice(0, firstScalableDim);
-  assert(!llvm::is_contained(scalableDimsToUnroll, true) &&
+  inputScalableVecDimsToUnroll =
+      inputScalableVecDimsToUnroll.slice(0, firstScalableDim);
+  assert(!llvm::is_contained(inputScalableVecDimsToUnroll, true) &&
          "unexpected leading scalable dimension");
   // Create an unroll iterator for leading dimensions.
   shapeToUnroll = shapeToUnroll.slice(0, firstScalableDim);
@@ -321,15 +323,15 @@ Value vector::createReadOrMaskedRead(OpBuilder &builder, Location loc,
                                      ArrayRef<int64_t> inputVectorSizes,
                                      Value padValue,
                                      bool useInBoundsInsteadOfMasking,
-                                     ArrayRef<bool> scalableDims) {
+                                     ArrayRef<bool> inputScalableVecDims) {
   assert(!llvm::is_contained(inputVectorSizes, ShapedType::kDynamic) &&
          "invalid input vector sizes");
   auto sourceShapedType = cast<ShapedType>(source.getType());
   auto sourceShape = sourceShapedType.getShape();
   assert(sourceShape.size() == inputVectorSizes.size() &&
          "expected same ranks.");
-  auto vectorType =
-      VectorType::get(inputVectorSizes, padValue.getType(), scalableDims);
+  auto vectorType = VectorType::get(inputVectorSizes, padValue.getType(),
+                                    inputScalableVecDims);
   assert(padValue.getType() == sourceShapedType.getElementType() &&
          "expected same pad element type to match source element type");
   int64_t readRank = inputVectorSizes.size();
@@ -358,8 +360,8 @@ Value vector::createReadOrMaskedRead(OpBuilder &builder, Location loc,
           ? memref::getMixedSizes(builder, loc, source)
           : tensor::getMixedSizes(builder, loc, source);
 
-  auto maskType =
-      VectorType::get(inputVectorSizes, builder.getI1Type(), scalableDims);
+  auto maskType = VectorType::get(inputVectorSizes, builder.getI1Type(),
+                                  inputScalableVecDims);
   Value mask =
       vector::CreateMaskOp::create(builder, loc, maskType, mixedSourceDims);
   return mlir::vector::maskOperation(builder, transferReadOp, mask)
diff --git a/mlir/test/Dialect/Linalg/vectorization/linalg-ops.mlir b/mlir/test/Dialect/Linalg/vectorization/linalg-ops.mlir
index 98e8f5079176c..b38d3bdedd52a 100644
--- a/mlir/test/Dialect/Linalg/vectorization/linalg-ops.mlir
+++ b/mlir/test/Dialect/Linalg/vectorization/linalg-ops.mlir
@@ -940,34 +940,113 @@ module attributes {transform.with_named_sequence} {
 ///----------------------------------------------------------------------------------------
 
 // CHECK-LABEL: func @test_vectorize_dynamic_shapes_unpack
-// CHECK-SAME:      %[[ARG_0:.*]]: tensor<?x?xf32>,
-func.func @test_vectorize_dynamic_shapes_unpack(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?x16x2xf32>) -> tensor<?x?xf32> {
-// CHECK: %[[C0:.*]] = arith.constant 0
-// CHECK: %[[DIM:.*]] = tensor.dim %arg0, %[[C0]] : tensor<?x?xf32>
-// CHECK: %[[C1:.*]] = arith.constant 1 : index
-// CHECK: %[[DIM0:.*]] = tensor.dim %arg0, %[[C1]] : tensor<?x?xf32>
-// CHECK: %[[CST:.*]] = arith.constant 0.000000e+00
-// CHECK: %[[C01:.*]] = arith.constant 0
-// CHECK: %[[C02:.*]] = arith.constant 0
-// CHECK: %[[DIM4:.*]] = tensor.dim %arg1, %[[C02]] : tensor<?x?x16x2xf32>
-// CHECK: %[[CNST14:.*]] = arith.constant 1
-// CHECK: %[[DIM6:.*]] = tensor.dim %arg1, %[[CNST14]] : tensor<?x?x16x2xf32>
-// CHECK: %[[CNST16:.*]] = arith.constant 16 : index
-// CHECK: %[[CNST2:.*]] = arith.constant 2 : index
-// CHECK: %[[readMsk0:.*]] = vector.create_mask %[[DIM4]], %[[DIM6]], %[[CNST16]], %[[CNST2]] : vector<2x1x16x2xi1>
-// CHECK: %[[read0:.*]] = vector.mask %[[readMsk0]] {{.*}} vector.transfer_read %{{.*}} : tensor<?x?x16x2xf32>, vector<2x1x16x2xf32> } : vector<2x1x16x2xi1> -> vector<2x1x16x2xf32>
-// CHECK: %[[trans0:.*]] = vector.transpose %[[read0]], [0, 3, 1, 2] : vector<2x1x16x2xf32> to vector<2x2x1x16xf32>
-// CHECK: %[[sc0:.*]] = vector.shape_cast %[[trans0]] : vector<2x2x1x16xf32> to vector<4x16xf32>
-// CHECK: %[[writeMsk0:.*]] = vector.create_mask {{.*}} : vector<4x16xi1>
-// CHECK: %[[write0:.*]] = vector.mask %[[writeMsk0:.*]] {{.*}} vector.transfer_write %[[sc0]], %[[ARG_0]]
-// CHECK: return %[[write0]]
- %ret = linalg.unpack %arg1 inner_dims_pos = [1, 0] inner_tiles = [16, 2] into %arg0 : tensor<?x?x16x2xf32> -> tensor<?x?xf32>
- return %ret : tensor<?x?xf32>
+// CHECK-SAME:      %[[DEST:.*]]: tensor<?x?xf32>,
+// CHECK-SAME:      %[[SRC:.*]]: tensor<?x?x16x2xf32>
+func.func @test_vectorize_dynamic_shapes_unpack(%dest: tensor<?x?xf32>, %src: tensor<?x?x16x2xf32>) -> tensor<?x?xf32> {
+  // CHECK: %[[C0:.*]] = arith.constant 0
+  // CHECK: %[[DIM:.*]] = tensor.dim %[[DEST]], %[[C0]] : tensor<?x?xf32>
+  // CHECK: %[[C1:.*]] = arith.constant 1 : index
+  // CHECK: %[[DIM0:.*]] = tensor.dim %[[DEST]], %[[C1]] : tensor<?x?xf32>
+  // CHECK: %[[CST:.*]] = arith.constant 0.000000e+00
+  // CHECK: %[[C01:.*]] = arith.constant 0
+  // CHECK: %[[C02:.*]] = arith.constant 0
+  // CHECK: %[[DIM4:.*]] = tensor.dim %[[SRC]], %[[C02]] : tensor<?x?x16x2xf32>
+  // CHECK: %[[CNST14:.*]] = arith.constant 1
+  // CHECK: %[[DIM6:.*]] = tensor.dim %[[SRC]], %[[CNST14]] : tensor<?x?x16x2xf32>
+  // CHECK: %[[CNST16:.*]] = arith.constant 16 : index
+  // CHECK: %[[CNST2:.*]] = arith.constant 2 : index
+  // CHECK: %[[MASK_READ:.*]] = vector.create_mask %[[DIM4]], %[[DIM6]], %[[CNST16]], %[[CNST2]] : vector<2x1x16x2xi1>
+  // CHECK: %[[READ:.*]] = vector.mask %[[MASK_READ]] {{.*}} vector.transfer_read %{{.*}} : tensor<?x?x16x2xf32>, vector<2x1x16x2xf32> } : vector<2x1x16x2xi1> -> vector<2x1x16x2xf32>
+  // CHECK: %[[TR:.*]] = vector.transpose %[[READ]], [0, 3, 1, 2] : vector<2x1x16x2xf32> to vector<2x2x1x16xf32>
+  // CHECK: %[[SC:.*]] = vector.shape_cast %[[TR]] : vector<2x2x1x16xf32> to vector<4x16xf32>
+  // CHECK: %[[MASK_WRITE:.*]] = vector.create_mask {{.*}} : vector<4x16xi1>
+  // CHECK: %[[WRITE:.*]] = vector.mask %[[MASK_WRITE:.*]] {{.*}} vector.transfer_write %[[SC]], %[[DEST]]
+  // CHECK: return %[[WRITE]]
+  %ret = linalg.unpack %src inner_dims_pos = [1, 0] inner_tiles = [16, 2] into %dest : tensor<?x?x16x2xf32> -> tensor<?x?xf32>
+  return %ret : tensor<?x?xf32>
 }
 module attributes {transform.with_named_sequence} {
  transform.named_sequence @__transform_main(%arg0: !transform.any_op {transform.readonly}) {
    %0 = transform.structured.match ops{["linalg.unpack"]} in %arg0 : (!transform.any_op) -> !transform.any_op
-   transform.structured.vectorize %0 vector_sizes [4, 16] : !transform.any_op
+   transform.structured.vectorize %0 vector_sizes [2, 1, 16, 2, 4, 16] : !transform.any_op
+   transform.yield
+ }
+}
+
+// -----
+
+// CHECK-LABEL: func @test_vectoriz...
[truncated]

@banach-space banach-space requested a review from egebeysel July 24, 2025 09:15
@banach-space banach-space changed the title [mlir][linalg] Enable scalable vectorization of linalg.unpack (WIP) [mlir][linalg] Enable scalable vectorization of linalg.unpack Jul 24, 2025
Copy link
Contributor

@egebeysel egebeysel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! I think the logic to extend the input vector sizes from write sizes to read & write sizes looks good, and to my understanding also makes sense to enable scalable vectorization. I've tested the WIP of this PR on downstream IREE and it seemed to serve what I was trying to do 😃 I'm not very familiar with the linalg vectorization, so I might've left more questions than comments, sorry about that! 🥲 Generally, I think more comments in code on what this change enables, what is the behaviour for user-specified sizes that are not "suitable" - if it's UB or some other future check fails etc. would help understand the behaviour better though.

SmallVector<int64_t> vectorSizes(inputVectorSizes);
if (vectorSizes.empty()) {
llvm::append_range(vectorSizes, sourceShape.take_front(destSize));
// SmallVector<int64_t> writeVectorSizes(inputVectorSizes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is forgotten here :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we add a comment here saying that this is the case that we would be inferring the write vector sizes from the IR? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is forgotten here :)

Thanks, let me remove that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we add a comment here saying that this is the case that we would be inferring the write vector sizes from the IR? Thanks!

Sure. In fact, I've re-written that comment (and the related logic) in this commit (it wasn't clear me how to expand it cleanly).

// vectorSizes is the shape of the vector that will be used to do final
// Split input-vector-sizes into vector sizes for the read and write
// operations.
if (!inputVectorSizes.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm curious about, even if we have fully static sizes and inner tiles, if the user specifies some size, we use that one. I think this was the previous behaviour as well, but should we maybe add some sort of sanity check for this? I'm not really familiar with the current behaviour either so this is rather a question on why things are the way they are :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these questions!

even if we have fully static sizes and inner tiles, if the user specifies some size, we use that one

If a "user" provides the configuration, the compiler ought to respect that. Otherwise, we could have a situation where a user provides some input and that's unexpectedly ignored by the compiler. (*)

Also, if a "user" provides the configuration, it is similar to saying "Hey, compiler, I know better, follow my instructions! And ignore the static loop bounds that you may have access to - like I said - I know better." :)

Note that:

  • If the user-specified vectors are too large, masking is used to make sure there are no out-of-bounds accesses.
  • If the user-specified vectors are smaller than the actual run-time input then there won't be any out-of-bounds accesses anyway.

Hopefully this makes sense :)

should we maybe add some sort of sanity check for this?

Yes, I have updated the pre-condition calculation in this commit. Can we do more and/or better? 🤔

(*) One option would be for the compiler to issue some diagnostic saying that the user input was ignored. However, I personally feel that we should avoid such high-level complexities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should verify if it has valid input vectors or not in precondition check, like the linalg one.

@banach-space banach-space force-pushed the users/banach-space/linalg/unpack_vec_update branch 2 times, most recently from 7151b93 to b389499 Compare July 25, 2025 10:46
@banach-space banach-space changed the base branch from main to users/banach-space/linalg/update_vec July 25, 2025 10:47
@banach-space
Copy link
Contributor Author

I'm not very familiar with the linalg vectorization, so I might've left more questions than comments, sorry about that!

No need to apologize! You took the time to review my PR, and it’s totally expected that you'd have questions - I'm more than happy to clarify things :)

This code should be clear to everyone, not just those who've worked on it before, so thanks for highlighting areas that could use more explanation. I’m happy to expand further if anything’s still unclear.

I’ve pushed quite a few changes and added links to the relevant commits in the comments. That’s introduced a bit of noise, though, so it might be easier to review this as one large change rather than commit-by-commit.

Thanks again for taking a look 🙏🏻

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same issue, i.e., require input vector size for both read and write, when I worked on pad op vectorization. I was thinking to provide two vector list, but appending one to the other one looks okay to me. Because it is op's implementation details, like tile sizes in tiling implementation.

Thanks for the patch, and I'll spend some time to integrate this into IREE: https://github.com/iree-org/iree/blob/a260a5e4c3033ed2aa35498865b856e68340b7dc/compiler/src/iree/compiler/Codegen/Utils/Utils.cpp#L1784C15-L1822

This patch addresses the issue by requiring explicit vector sizes for both the read and write sides, enabling scalable vectorization in such cases.

We should document it somewhere. I don't find a good place, but maybe in vectorizeUnPackOpPrecondition's function comment? I think it is better than nothing.

// vectorSizes is the shape of the vector that will be used to do final
// Split input-vector-sizes into vector sizes for the read and write
// operations.
if (!inputVectorSizes.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should verify if it has valid input vectors or not in precondition check, like the linalg one.

if (!inputVectorSizes.empty()) {
if (inputVectorSizes.size() !=
unpackOp.getDestRank() + unpackOp.getSourceRank()) {
LDBG("Incorrect number of input vector sizes");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, you'll need to switch to LDBG() << "Incorrect..." after rebase. see 07967d4

@banach-space banach-space force-pushed the users/banach-space/linalg/update_vec branch from 47f7f4a to 0e08666 Compare July 28, 2025 07:38
Base automatically changed from users/banach-space/linalg/update_vec to main July 28, 2025 08:00
@banach-space banach-space force-pushed the users/banach-space/linalg/unpack_vec_update branch from cae3fb4 to 1d9f737 Compare July 28, 2025 08:01
Copy link

github-actions bot commented Jul 28, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@hanhanW
Copy link
Contributor

hanhanW commented Jul 30, 2025

Yes, +1 to a one self-contained change. I've been thinking it this morning, and chatted with @banach-space.

TLDR: I'm convinced that the read approach is better.

I feel that the root cause is that we don't know what the scalable size is for an inner tile; the read approach provides such information. For both approaches, we all need to do an analysis somewhere to compute the vector shapes for read/write, if there are scalable inner tiles.

It is reasonable to use write approach, if the inference is simple, e.g., static inner tiles. It may add too many burdens to vectorization itself, if it has scalable inner tiles.

The read approach is more like moving the expensive analysis outside vectorization. It is user's choice about how to provide vector shapes. The read approach isolates the analysis and vectorization better, I think.

We shouldn't expect the vectorizer to do any "guessing".

Base automatically changed from users/banach-space/remove-dead-code to main July 30, 2025 19:34
banach-space added a commit that referenced this pull request Jul 31, 2025
…nalg.unpack

This patch introduces a new helper, `getCollapsedVecType`, and updates
`vectorizeAsTensorUnpackOp` to use it. The motivation stems from improving
how `vector.shape_cast` operations are generated when vectorizing
`linalg.unpack`.

Previously, the vectorizer relied on
* `tensor::CollapseShapeOp::inferCollapsedType`

to compute the collapsed vector type. This approach is suboptimal
because:
  * `inferCollapsedType` lacks awareness of scalable vector flags.
  * Linalg vectorization should not depend on Tensor dialect utilities.

Instead of relocating `inferCollapsedType`, we introduce
`getCollapsedVecType` — a lightweight, specialized hook that:
  * Assumes no dynamic sizes.
  * Handles scalable flags alongside shape dimensions.

This change also reduces temporary variables in
`vectorizeAsTensorUnpackOp` and paves the way for a cleaner update in
 #149293.
@banach-space banach-space force-pushed the users/banach-space/linalg/unpack_vec_update branch from 56108b1 to 97fbccd Compare July 31, 2025 12:18
@banach-space banach-space changed the base branch from main to users/banach-space/update-vec-unpack July 31, 2025 12:18
@banach-space
Copy link
Contributor Author

UPDATE: 31/7/25

I’m quite happy with the current state - vectorizeAsTensorUnpackOp is much simpler now. Many thanks for the discussion, @hanhanW 🙏🏻

Also, kudos to @egebeysel - your comment prompted me to rethink the approach. That's the kind of review feedback every contributor hopes for!

@egebeysel
Copy link
Contributor

Also, kudos to @egebeysel - your comment prompted me to rethink the approach. That's the kind of review feedback every contributor hopes for!

I'm glad to have helped and thanks for the updates! 😄 I'm back from a break, I'll try and take another look at this tomorrow, if not today.

Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Just 3 nits.

Comment on lines 1901 to 1903
/// ```
/// vector.transfer_write %sc into %dest : vector<8x8xf32>, tensor<8x8xf32>
/// ```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You deleted few ops in b073854. I think we should add the xfer_read, transpose and shape_cast back.

vector::TransposeOp transposeOp = vector::TransposeOp::create(
rewriter, loc, readResult, lastDimToInsertPosPerm);

// Collapse the vector to the size required by result.
// 3. Generate the shape_cast operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I don't like labeling numbers. This should be 4 and below should be 5. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers are not important, so let me remove them and use some other method to highlight the key parts :)

banach-space added a commit that referenced this pull request Aug 1, 2025
…nalg.unpack

This patch introduces a new helper, `getCollapsedVecType`, and updates
`vectorizeAsTensorUnpackOp` to use it. The motivation stems from improving
how `vector.shape_cast` operations are generated when vectorizing
`linalg.unpack`.

Previously, the vectorizer relied on
* `tensor::CollapseShapeOp::inferCollapsedType`

to compute the collapsed vector type. This approach is suboptimal
because:
  * `inferCollapsedType` lacks awareness of scalable vector flags.
  * Linalg vectorization should not depend on Tensor dialect utilities.

Instead of relocating `inferCollapsedType`, we introduce
`getCollapsedVecType` — a lightweight, specialized hook that:
  * Assumes no dynamic sizes.
  * Handles scalable flags alongside shape dimensions.

This change also reduces temporary variables in
`vectorizeAsTensorUnpackOp` and paves the way for a cleaner update in
 #149293.
@banach-space banach-space force-pushed the users/banach-space/update-vec-unpack branch from 7fc78d2 to 4d6cb14 Compare August 1, 2025 10:15
banach-space added a commit that referenced this pull request Aug 1, 2025
…nalg.unpack (#151503)

This patch introduces a new helper, `getCollapsedVecType`, and updates
`vectorizeAsTensorUnpackOp` to use it. The motivation stems from improving how
`vector.shape_cast` operations are generated when vectorizing `linalg.unpack`.

Previously, the vectorizer relied on
* `tensor::CollapseShapeOp::inferCollapsedType`

to compute the collapsed vector type. This approach is suboptimal
because:
  * `inferCollapsedType` lacks awareness of scalable vector flags.
  * Linalg vectorization should not depend on Tensor dialect utilities.

Instead of relocating `inferCollapsedType`, we introduce
`getCollapsedVecType` — a lightweight, specialized hook that:
  * Assumes no dynamic sizes.
  * Handles scalable flags alongside shape dimensions.

This change also reduces temporary variables in
`vectorizeAsTensorUnpackOp` and paves the way for a cleaner update in
 #149293.
Base automatically changed from users/banach-space/update-vec-unpack to main August 1, 2025 10:26
This patch updates `vectorizeAsTensorUnpackOp` to support scalable
vectorization by requiring user-specified vector sizes for both the
_read_ and _write_ operations involved in `linalg.unpack`. Detailed
rationale and an example are provided below.

Conceptually, `linalg.unpack` consists of the following high-level steps:
  1. _Read_ from the source tensor.
  2. Transpose the value read in step (1).
  3. _Write_ the value from step (2) into the destination tensor.

Currently, when vectorizing with user-provided vector sizes, only the
sizes for the _write_ operation (step 3) are required. Sizes for the
_read_ operation (step 1) are inferred from static shapes and inner tile
sizes.

This logic breaks when the input shapes or tile sizes are dynamic
(indeed, `vectorizeUnPackOpPrecondition` rejects such cases ATM and the
vectorization fails). This patch addresses the issue by requiring
explicit vector sizes for both the read and write sides, enabling
scalable vectorization in such cases.

Example:

```mlir
func.func @unpack(%in: tensor<1x1x8x?xf32>, %out: tensor<8x?xf32>) -> tensor<8x?xf32> {
  %vs = vector.vscale
  %c8 = arith.constant 8 : index
  %tile_size = arith.muli %vs, %c8 : index

  %unpack = linalg.unpack  %in
    inner_dims_pos = [0, 1]
    inner_tiles = [8, %tile_size]
    into %out : tensor<1x1x8x?xf32> -> tensor<8x?xf32>
  return %unpack : tensor<8x?xf32>
}

module attributes {transform.with_named_sequence} {
  transform.named_sequence @__transform_main(%arg0: !transform.any_op {transform.readonly}) {
    %0 = transform.structured.match ops{["linalg.unpack"]} in %arg0 : (!transform.any_op) -> !transform.any_op
    transform.structured.vectorize %0 vector_sizes [1, 1, 8, [8],  8, [8]] : !transform.any_op
    //                                              \         /    \    /
    //                                              read-sizes   write-sizes
    transform.yield
  }
}
```

Finally, this patch also extends `createReadOrMaskedRead` and
`createWriteOrMaskedWrite` to take scalable flags.
…unpack (WIP)

Improve documentation + fix test after rebasing on top of
* #150602
…linalg.unpack (WIP)

Remove unintended test change
…torization of linalg.unpack (WIP)

Fix comment
@banach-space banach-space force-pushed the users/banach-space/linalg/unpack_vec_update branch from 97fbccd to 6548876 Compare August 1, 2025 10:29
Copy link
Contributor

@egebeysel egebeysel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

hanhanW pushed a commit to iree-org/llvm-project that referenced this pull request Aug 1, 2025
…nalg.unpack (llvm#151503)

This patch introduces a new helper, `getCollapsedVecType`, and updates
`vectorizeAsTensorUnpackOp` to use it. The motivation stems from improving how
`vector.shape_cast` operations are generated when vectorizing `linalg.unpack`.

Previously, the vectorizer relied on
* `tensor::CollapseShapeOp::inferCollapsedType`

to compute the collapsed vector type. This approach is suboptimal
because:
  * `inferCollapsedType` lacks awareness of scalable vector flags.
  * Linalg vectorization should not depend on Tensor dialect utilities.

Instead of relocating `inferCollapsedType`, we introduce
`getCollapsedVecType` — a lightweight, specialized hook that:
  * Assumes no dynamic sizes.
  * Handles scalable flags alongside shape dimensions.

This change also reduces temporary variables in
`vectorizeAsTensorUnpackOp` and paves the way for a cleaner update in
 llvm#149293.
@banach-space banach-space merged commit 3692c73 into main Aug 6, 2025
9 checks passed
@banach-space banach-space deleted the users/banach-space/linalg/unpack_vec_update branch August 6, 2025 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants